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ABSTRACT 

Motivation: The nucleosome is the basic repeating unit of chromatin. 
It contains two copies each of the four core histones H2A, H2B, H3 
and H4 and about 1 47 bp of DNA. The residues of the histone proteins 
are subject to numerous post-translational modifications, such as 
methylation or acetylation. Chromatin immunoprecipitiation followed 
by sequencing (ChlP-seq) is a technique that provides genome-wide 
occupancy data of these modified histone proteins, and it requires 
appropriate computational methods. 

Results: We present NucHunter, an algorithm that uses the data from 
ChlP-seq experiments directed against many histone modifications to 
infer positioned nucleosomes. NucHunter annotates each of these 
nucleosomes with the intensities of the histone modifications. We 
demonstrate that these annotations can be used to infer nucleosomal 
states with distinct correlations to underlying genomic features 
and chromatin-related processes, such as transcriptional start sites, 
enhancers, elongation by RNA polymerase II and chromatin-mediated 
repression. Thus, NucHunter is a versatile tool that can be used to 
predict positioned nucleosomes from a panel of histone modification 
ChlP-seq experiments and infer distinct histone modification patterns 
associated to different chromatin states. 

Availability: The software is available at http://epigen.molgen.mpg.de/ 
nuchunter/. 

Contact: chung@molgen.mpg.de 

Supplementary information: Supplementary data are available at 
Bioinformatics online. 
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1 INTRODUCTION 

The genome of eukaryotes is packaged into a macromolecular 
structure called chromatin. The basic repeating unit of chromatin 
is the nucleosome, which contains two copies each of the four 
core histones H2A, H2B, H3 and H4, around which a 147-bp 
stretch of DNA is wrapped in a flat left-handed superhelix 
(Luger and Richmond, 1998). Nucleosomes form approximately 
every 200 bp along the genome to package the underlying DNA. 
Apart from the packaging function, nucleosomes may serve as a 
signaling module (Turner, 2012) that is integrated into biological 
processes acting with and on chromatin. This signaling function 
depends on post-translational modifications of the histone 
proteins, such as acetylation and methylation of lysine residues. 
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These histone modifications may serve as a binding platform for 
non-histone proteins, whose activities change chromatin struc- 
ture and function. 

When nucleosomes tend to form at the same or nearby gen- 
omic positions in different cells, they are called (well) positioned. 
Positioned nucleosomes are important for the hypothesis that 
nucleosomes constitute a signaling module, because gross move- 
ments of modified nucleosomes along the chromatin fibers may 
lead to a loss of coherence between the modifications and the 
genomic features and/or functions. 

The binding locations of modified histone proteins can be 
determined by a technique called chromatin immunoprecipitation 
followed by sequencing [ChlP-seq; Johnson et al. (2007)]. The 
immunoprecipitation step enriches for chromatin fragments con- 
taining a histone modification of interest, whereas the sequencing 
step is used to quantify the abundance of the underlying DNA. 

Because the core histone proteins are part of a stable protein- 
DNA complex, it is natural to assume that the localization of 
modified histone proteins corresponds to the position of the nu- 
cleosomes. This suggests that histone modification ChlP-seq 
data can be used to infer nucleosome positions. However, this 
is far from being a trivial task for a number of reasons: (i) histone 
binding does not seem to be as sequence- specific as for many 
transcription factors; (ii) nucleosome positions can change con- 
siderably with time and across cells; and (iii) the data are affected 
by sparse samphng and high noise. 

Nucleosome caUing algorithms, such as the one presented 
here, aim at detecting positioned nucleosomes. To obtain a com- 
prehensive and reliable set of predictions, one should combine 
the information contained in as many ChlP-seq experiments as 
possible and allow for some plasticity in the shape of the signal. 
However, modified histones tend to be mixed-source factors 
(Landt et al, 2012), which means that the degree of positioning 
can vary considerably across the genome. In regions where nu- 
cleosomes occupy different positions in different cells (e.g. within 
the body of actively transcribed genes), nucleosome calling algo- 
rithms are less suitable than segmentation approaches [Song and 
Smith (201 1); Zang et al. (2009), to mention a few], which aim at 
detecting domains of high nucleosome abundance. 

A number of tools for the inference of nucleosome positions 
have already been developed. Most of them apply signal process- 
ing techniques, such as Fourier transforms (Flores and Orozco, 
2011), wavelet decomposition (Zhang et al, 2008a) and ad hoc 
filters (Albert et al, 2007; Weiner et al, 2010), to smooth the 
enrichment profile, followed by the detection of local maxima. 
Others are based on Bayesian modeling of the nucleosome 
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enrichment pattern (Zhang et al., 2012). These methods do not 
aUow one to control for systematic biases by comparing the nu- 
cleosome calls with data from control experiments. Furthermore, 
they cannot integrate data from multiple histone marks in a 
straightforward manner. Finally, because of the large size of 
the problem, e.g. the human genome, and the potentially high 
number of histone modifications, the runtime and memory con- 
sumption of these tools may limit their applicability. 

Our tool can use information from a control sample to correct 
for systematic biases inherent in this high-throughput technol- 
ogy. It is designed to integrate multiple histone marks to broaden 
the range of nucleosome positions that can be detected. It anno- 
tates each identified nucleosome with the contributing histone 
modifications. We will demonstrate that these annotations can 
be used to cluster nucleosomes by their histone modification 
patterns. This clustering yields patterns of modifications that 
can be correlated to the function of the chromatin, such as tran- 
scriptional start sites and enhancers, or to the underlying process, 
such as transcriptional elongation by RNA polymerase II. These 
results support the assumption that nucleosomes serve as signal 
modules for biological process and that the corresponding his- 
tone modification patterns are a reflection of the signaling taking 
place on these modules. 



2 METHODS 

The algorithm performs three major steps: (i) a preprocessing step, where 
each file containing the chromosomal positions of mapped reads is turned 
into a numerical signal, (ii) a shape detection step, where candidate pos- 
itions for nucleosome formation sites are detected and (iii) a filtering step, 
where these candidates are filtered and scored accounting for a number of 
possible sources of bias. In the following, we will refer to the enrichment 
profile on the positive or negative strand as the signal that counts for each 
location the number of positive or negative reads whose 5' end maps 
there, and they will be denoted P(p) and N(p), respectively, where p is 
the chromosomal position. 

2.1 Preprocessing 

A well-positioned nucleosome typically exhibits the enrichment profile 
shown in Figure 1 : a peak of positive strand reads upstream of the nucleo- 
some location, and one of negative strand reads downstream. To obtain a 
consensus signal, which will be called the input signal /, the enrichment 
profile on the positive strand P is shifted to the right, the one on the 
negative strand TV is shifted to the left and the sum of the two is considered. 
Denoting with F the average length of a fragment in the DNA library, the 
amount of this shift is about F/2, which yields the input signal: 

I(p) = P(p-F/2) + N(p + F/2). 

In case of single-end sequencing data, usually the average fragment length 
needs to be estimated from the data itself. This estimation can be carried 
out by several available tools [such as Zhang et al. (2008b)]. However, 
because of the mixed source nature of the data, we found the available 
methods unsatisfactory when applied to histone marks, and therefore, as 
part of NucHunter, we also provide a method for estimating the average 
fragment length (described in Section 2.4). 

2.2 Peak detection 

In the peak detection step (see Fig. 2), a suitable filter is applied to the 
input signal, followed by the detection of local maxima in the filtered 
signal and the analysis of the statistical significance of these maxima. 
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Fig. 1. Preprocessing: from mapped reads to consensus signal. Positive 
and negative reads generate a strand specific enrichment profile which 
counts at each position the amount of reads whose 5' end maps there. The 
consensus signal is obtained by shifting the strand specific enrichment 
profiles F/l bases downstream, where F is the average fragment size, 
and summing them up 




Fig. 2. Peak detection from the consensus input signal. The input signal is 
smoothed using a filter with a certain impulse response, then the maxima 
of the resulting signal are detected and non- significant local maxima are 
filtered out 



A filter (more precisely a linear time-invariant filter) is characterized by 
a discrete signal K(p) called impulse response. Given an input signal I(p), 
the filter output 0(p) is the result of the following operation, called 
convolution: 

0(p) = (IxK)(j,) = J2 1(P -J WU), 

J 

where the index j ranges over positions where K(j) is not 0. 

The impulse response in our approach has been chosen according to 
the following two criteria: first, it must separate sharp peaks from more 
spread out read distributions or non-enriched regions; second, it must 
have good smoothing properties, so that the convoluted signal contains a 
limited number of local maxima (Rice, 1944) and, therefore, the algo- 
rithm returns fewer false positives. We chose as impulse response the 
second derivative of a Gaussian density function, also known as the 
Mexican hat wavelet (see Fig. 3): 



The Mexican hat wavelet removes from the Fourier spectrum of the input 
signal both high- and low-frequency components (band-pass filter), which 
is appropriate if we interpret high frequencies as random oscillations 
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because of noise or insufficient coverage, and low frequencies as broad 
ambiguous peaks coming from a mixture of nucleosome positions, or as 
local biases such as GC content or open chromatin. The wavelet is para- 
metrized by the scale parameter a. In our studies, we chose a default value 
of 50 for a because, in general, it corresponds to a good compromise 
between calling too many peaks and merging closely spaced ones. The 
parameter can also be fitted to the dataset under consideration using the 
method outlined in Section 2.4. 

Obtaining the convoluted signal for large genomes poses computa- 
tional problems. In fact, a long signal as impulse response results in 
a slow convolution operation. In NucHunter the convolution has 
been implemented using recursive filters, an efficient signal-processing 
technique (Hale, 2006). 

Once local maxima are extracted from the filter output, their statistical 
significance is assessed. To this end, we model the noise by assuming that 
values of the input signal within a certain region are independent identi- 
cally distributed random variables (rvs). Using this assumption, we derive 
the mean and standard deviation of the convoluted signal, and we assign 
a z-score to each local maximum. If I(p) denotes the input signal, K(p) the 
impulse response and 0(p) the convoluted signal at position p, let m(p) 
and std(p) denote, respectively, the mean and standard deviation of the 
input signal in a large region R that contains position p, then the z-score 
is given by: 



m(p) : 



std(p) -- 



z-score(/?) : 



E 

0(P)- 



(I(k)-m(p)y 



\R\-\ 



The detected peaks are all those local maxima with a z-score above a 
certain threshold. This z-score represents the strength of a peak, and a 
user-defined threshold, whose default value is 3, specifies how many 
standard deviations above average the peaks' strength must be. 

Additionally, the peaks are assigned a fuzziness score that represents 
the degree of uncertainty about the peak position, given by the formula: 



fuzziness (p) - 
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where 0^^(p) denotes the second discrete derivative of the filter output. 
2.3 Filtering and scoring 

After a set of putative peaks has been derived, additional filtering steps 
are carried out when a control sample is available. They are all based on 
the enrichment level of a peak, defined as the total number of reads that 



contribute to the input signal in a window of a certain size (by default 
147 bp) centered around the peak. 

Peaks are filtered in a similar manner as in Zhang et al. (2008b): the 
enrichment level is modeled as a Poisson rv whose parameter is estimated 
from both a global and a local average of the control sample, which is 
rescaled so that the number of sequenced reads in the two samples 
matches. From this model a P- value is obtained, and peaks can be filtered 
based on a user-defined threshold (which defaults to 10~^). 

In more detail, let G denote the genome length. The noise level in 
the control sample is rescaled according to the total coverage ratio 

a = ^''"^^^^^ so that the local and global noise estimates Xw and Xtot 

can be expressed, respectively, as a ^^2w+\^^^^ ^ z^/c=gi ^^^^ ^ ^j^g 
final noise estimate X is chosen as the maximum between the two 
{W defaults to 1000). Finally, the null model for the read counts in a 
window of a fixed radius R is given by: 

F 

J2 I(P + k)^ Poiss{(2F+ 

k=-F 

The next filtering step consists in controlling the relative amount of posi- 
tive and negative reads in the enrichment level, as highly unbalanced 
contributions from the two strands are likely to arise from mapping 
biases. Following the approach of Zhang et al. (2008a), we filter out 
peaks where the ratio between the two contributions is not contained 
in the interval [r, 1/r], where r defaults to 4. 

A final step takes place when the sample is obtained from multiple 
ChlP-seq experiments. In this case, the consensus input signals from the 
different samples are added together, and the above steps are carried out 
as if the signal came from a single experiment. After that, however, the 
enrichment level at each peak is decomposed into the contributions of the 
different experiments, and each of them is tested independently to assess 
whether a certain histone modification is present or not (see Fig. 4). The 
tests are carried out using the same noise model and formulas shown 
above, where now / corresponds to the consensus signal derived from 
the single histone modifications. 

Finally, for each nucleosome call the algorithm provides, along with 
the genomic coordinates, the following statistics: (i) the z-score (peak 
strength), (ii) the input signal enrichment level (sum of the raw read 
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Fig. 4. Integration of multiple histone modification experiments. First, 
peak detection is performed on the sum of the input signals, then the 
signal is decomposed into the contributions of the single histone modifi- 
cations and then a statistical test is performed for each of them to asses 
whether their contribution is significant or not 
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counts in a window of 147 bp around the peak), (iii) the control signal 
enrichment level (sum of the smoothed read counts in the same window), 
(iv) a P-value derived from the comparison between input and control 
enrichment levels (significance of the enrichment) and (v) the fuzziness 
score for the peak position. In case multiple samples are simultaneously 
analyzed, it is also provided, for each input sample and each nucleosome 
call, the contribution to the total enrichment level in terms of raw reads 
and the result of the statistical test as an on/off flag. 

2.4 Inferring the average fragment length 

The average fragment length F is typically inferred based on the strand 
cross-correlation function, defined as: 

CC(/c) = ^P(p)7V(p + /c), 

p 

where the index p spans all genomic positions. For point-source factors 
and low noise levels the cross-correlation function usually has a peak at 
position F (the 'fragment peak'), as shown in Figure 5, which yields a 
straightforward method for the estimation of F. However, for many his- 
tone marks the cross-correlation plot is harder to interpret because of the 
presence of a so-called phantom-peak (Landt et al., 2012) and other sys- 
tematic biases, which can sometimes completely obscure the fragment 
peak (see Fig. 5b). To account for these biases, we introduce a modified 
cross-correlation function that we call peak cross-correlation (pec): 

p 

The signals N and P are a dense representation of the peaks obtained 
applying a peak detection algorithm to the strand-specific signals TV and 
P. More specifically, N and P are binary signals whose only non-zero 
entries are ones occurring at the peaks' locations. The peak detection 
technique presented in Section 2.2 applied to the consensus signal / is 
applied to the signals and P. 

The pec function, which is used to infer F, can also indicate how ap- 
propriate the choice of cr is, where cr is the parameter used for peak 
detection (see Section 2.2). If and P are assumed to be two replicates 
of the same signal with a systematic shift of F base pairs in the nucleo- 
some peaks, a good choice of a should result in a strong peak in the pec 
function around position F, whereas a bad choice should lead to almost 
independent peaks in the two strands and a flat pec function. 

After the pec function is computed, a clustering technique is applied to 
interpret it, which yields an estimate for F and a quality score for cr. We 
assume that the plot is generated by sampling from a mixture of three rvs: 
a uniform rv to model the background noise, a Gaussian rv to model the 
phantom peak and another Gaussian rv to model the fragment peak, as 
shown in Figure 5c. The parameters of these distributions are inferred 
using an expectation-maximization algorithm, and the mean of the 
Gaussian rv corresponding to the fragment peak is used as an estimate 
for the average fragment length. The quality score, which is derived from 
the likelihood of the inferred model, can be computed for different values 
of a and yields a score curve (see Section 1 in Supplementary Material for 
more details). 

3 RESULTS 

3.1 Comparison to other available tools 

We have developed NucHunter to identify nucleosome positions 
using histone modification ChlP-seq data. To test the predictive 
power of our algorithm and to compare it with other available 
tools, we ran NucHunter and two other tools [Nucleosome 
Positioning from Sequencing (NPS) from Zhang et al. (2008a); 
Template Filter from Weiner et al. (2010)] on a H3K9ac dataset 
from yeast (Weinberger et al, 2012). Some tools had to be 
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Fig. 5. Strand cross-correlation analysis for some ChlP-seq experiments 
in human K562 cells, (a) Histone modification H3K4me3, a point-source 
or mixed-source factor. The phantom peak and the fragment peak are 
clear, (b) Histone modification H3K9me3, a broad-source factor. The 
fragment peak is almost not visible, in contrast with the phantom peak, 
(c) The pec for the histone modification H3K9me3. Now also the frag- 
ment peak is visible, and it is possible to infer the average fragment length 
with an EM algorithm 
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excluded from the comparison either because they were not able 
to deal with the large amount of data or because the results 
obtained using default parameters were unsatisfactory. We 
chose yeast because we wanted to compare the predictions to a 
base pair resolution map of nucleosome positions in yeast 
(Brogaard et al., 2012). This map has been obtained with a tech- 
nique that, even if it has not been tested widely yet, is independ- 
ent from ChlP-seq, and it is claimed to be more accurate. 

In line with previous studies (Chung and Vingron, 2009), to 
compare the nucleosome predictions with the nucleosome map, 
we used the (normalized) area under the cumulative error curve 
(AUC) as a performance measure. The AUC was obtained apply- 
ing the following procedure (see also Supplementary Material): 

(1) we consider the set of all distances between nucleosome 
predictions and nucleosomes in the map <73 bp, 

(2) we obtain a cumulative error curve. In such a curve, a 
point (x,y) means that a fraction y of the distances is 
less than x base pairs (see Supplementary Fig. S5), 

(3) we compute the AUC and we normalize it, so that a set of 
perfect predictions has an AUC of 1 and a random set of 
genomic positions has an expected AUC of 0.5. 

Along with the AUC, we also computed the sensitivity and the 
specificity, defined, respectively, as the fraction of nucleosomes in 
the map that are closer than 20 bp to a nucleosome prediction 
and the fraction of nucleosome predictions that are closer than 
20 bp to a nucleosome in the map. Moreover, to account for the 
great variability in the number of predictions returned by each 
tool, we repeated the performance measurements for different 
score thresholds. 

The results from Figure 6 show that NucHunter makes more 
accurate predictions compared with the other tools. Considering 
the default score thresholds, NucHunter and NPS return a similar 
number of predictions but the former has an higher AUC than the 
second, whereas Template Filter returns many more predictions 



and of lower quality. When the score threshold is increased, the 
AUC difference between NucHunter and NPS becomes much 
more pronounced. This suggests that the nucleosome predictions 
with highest score from NucHunter are, in general, much more 
precise compared with those from the other tools. All the tools 
suffer from low sensitivity in this dataset, in particular NucHunter 
and NPS when the default score thresholds are used. 

The reasons for unidentified nucleosomes or incorrect predic- 
tions can be many. In the first place, the experimental procedures 
used for the ChlP-seq experiment and that used for the nucleo- 
some map are different. Roughly 5.6% of the nucleosomes in 
the map, for instance, are located in low-mappability regions 
and are not covered by any read. Moreover, the ChlP-seq 
experiment targeted only acetylated nucleosomes, as opposed 
to the nucleosome map. A more general problem is the identifi- 
cation of fuzzily positioned nucleosomes. If the nucleosome 
positioning varies extensively from cell to cell, the assumptions 
made by the algorithms are violated and nucleosomes are hard 
to identify. Lastly, both specificity and sensitivity are affected 
from high noise levels, insufficient sequencing coverage and 
sequencing biases. 

In addition to the yeast dataset, we also tested the algorithms 
on a simulated dataset and on different histone modification 
ChlP-seq files in human K562 cells. In the simulated dataset, 
the nucleosome map is randomly generated and the reads are 
generated accordingly. Because there is no nucleosome map 
for the human dataset, we used pairs of replicate experiments 
and pairs of different histone modifications as gold standard-pre- 
dictions pairs. The details of the simulation and the performance 
evaluations are reported in Supplementary Material in Sections 3 
and 4. In general, the results are in agreement with those 
shown previously. In the Supplementary Material, it is also 
shown that NucHunter runs faster and requires less memory 
than the other two algorithms (see Supplementary Material 
Section 5). 
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Fig. 6. Accuracy assessment of different tools on the yeast dataset. The performance measures (AUC, sensitivity and specificity) are computed for every 
possible score threshold, which results in an AUC number of calls curve (left) and a specificity-sensitivity curve (right). The circles indicate the 
performance of the algorithms using the default thresholds 
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3.2 Clustering of nucleosomes based on histone marks 

We ran NucHunter on a composite dataset from a human leu- 
kemia cell line [K562, Myers et al. (2011)] consisting of a control 
experiment and 12 ChlP-seq experiments for different histone 
modifications. For each detected nucleosome and for each ex- 
periment the algorithm returned, along with other statistics, the 
raw read count within a window of a specified width (which 
defaults to 147) around the inferred nucleosome location (see 
Methods). 

We used these read counts for an exploratory analysis of the 
chromatin landscape. After a normalization procedure that cor- 
rects the read counts taking into account the control sample, the 
different sequencing depths of the dataset s and the nucleosome 
abundance at each locus, we obtained a joint histone modifica- 
tion level distribution and we applied the k-means clustering al- 
gorithm on it (see Supplementary Material for more details). 
Given a parameter k, this unsupervised learning method aims 
at partitioning the data points into k different families (clusters) 
such that elements in the same cluster are as similar to each other 
as possible. Each cluster is characterized by its centroid, which is, 
in our case, a prototypical histone modification pattern. 

We found that with k equals 6 the results are robust, whereas 
for higher values of the parameter the clusters tend to change 
depending on the initialization (see Supplementary Material). 
Moreover and most importantly, we found that such a partition- 
ing, derived solely from the histone modification patterns, 
can also capture biologically meaningful positional features 
of the nucleosomes. We assigned labels to each cluster based 
on the histone modification pattern and genomic localization. 
The labeled centroids are shown in Figure 7. 

We studied the genomic localization of nucleosomes from 
the different clusters using the RefSeq annotation dataset as 
well as publicly available data from cap analysis of gene expres- 
sion (CAGE) and DNase I hypersensitivity sequencing experi- 
ments (Myers et al., 201 1). We performed the following analyses 
(further discussed in Supplementary Material): (i) we derived a 
consensus nucleosome profile along genes by considering a large 
set of annotated genes, by rescaling their nucleosome profiles to 
the same length and by adding them up (Fig. 8a); (ii) we analyzed 
the nucleosome positioning around promoters of active genes by 
considering the distribution of distances between CAGE tags 
and nucleosomes (Fig. 8b); and (iii) we obtained the average 
DNase I hypersensitivity profile around nucleosomes for each 
class (Fig. 8c). 

Overall these data give a clear picture of the nucleosome land- 
scape and recapitulate previous knowledge (see Fig. 7). The nu- 
cleosomes in the first family are characterized by a strong 
enrichment of H3K4me2/3 and H3K9ac, and they tend to 
reside in the 5' portion of a gene near the transcriptional start 
site (TSS; Fig. 8a). Thus, we labeled them 'promoter' nucleo- 
somes. In proximity of promoters of active genes, these nucleo- 
somes exhibit a strikingly regular pattern (Fig. 8b), whose main 
features are a nucleosome-depleted region right upstream the 
TSS and a well-positioned nucleosome 170 bp downstream (the 
+ 1 nucleosome). The second and third clusters show an enrich- 
ment of H3K4mel and H2AZ as well as a general enrichment of 
active marks, whereas TSS-associated histone marks, such as 
H3K4me2/3 and H3K9ac, are less enriched compared with the 
promoter cluster. These features, together with the high levels of 
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Fig. 7. Using the /r-means algorithm, 422 547 nucleosomes called by 
NucHunter were clustered into six clusters: promoter (20.4%), enhancer 
1 (19.8%), enhancer 2 (14.4%), elongation early (16.4%), elongation late 
(14.7%) and repressed (14.3%). The rows of the heatmap represent the 
centroids of the clusters and the columns represent the histone modifica- 
tions. The labels have been assigned based on prior biological knowledge 



DNase I hypersensitivity that we observe (Fig. 8c), suggest that 
these nucleosomes may flank enhancer sequences. Thus, we 
labeled them as 'enhancer 2' and 'enhancer 1' nucleosomes. 
The fourth centroid is enriched in H3K79me2 and H4K20mel, 
whereas the fifth centroid is enriched in H3K36me3 and 
H3K9mel, which are all histone marks related to elongation of 
RNA polymerase II (Vavouri and Lehner, 2012). The localiza- 
tion of these two classes of elongation nucleosomes along the 
gene body, shown in Figure 8a, suggests that the 5th centroid 
is enriched toward the 3' end of a gene, whereas the 4th centroid 
is enriched more to the 5' end. Thus, we termed them 'elongation 
early' and 'elongation late' nucleosomes, respectively. The last 
centroid is characterized by an enrichment of H3K9me3 and 
H3K27me3, suggesting that it represents chromatin-repressed 
genomic regions (Margueron and Reinberg, 2011). Thus, we 
termed it 'repressed'. 

Lastly, we explored the relation between the different clusters 
that we obtained and a previously pubhshed study (Ernst and 
Kelhs, 2010) that aimed at classifying the chromatin landscape 
into discrete states. Even though the last method uses a more 
complex model, different data sources and positional relations 
between histone modification patterns, we found that the overall 
results are comparable (see Supplementary Material). We beheve 
that the joint analysis of histone modification patterns and 
nucleosome positioning that NucHunter allows for provides 
complementary information and offers a greater potential than 
histone modification studies based on arbitrary binning schemes 
of the genome. 
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Fig. 8. Genomic localization of the different nucleosome classes in human K562 cells, (a) Occupancy of nucleosomes from the different classes along the 
gene body. The nucleosome occupancy profiles from a subset of genes in RefSeq have been rescaled to the same length and summed up. (b) Nucleosome 
distribution at promoters of active genes. The profile has been obtained by computing the distribution of distances between CAGE tags and nucleo- 
somes. (c) DNase I hypersensitivity levels in relation to nucleosomes. The profile for each nucleosome class is the average DNase I hypersensitivity 
profile of all nucleosomes from that class 



4 DISCUSSION 

The fast-paced development of chromatin immunoprecipitation- 
based techniques is heading toward an increased spatial reso- 
lution for DNA-protein interactions. In line with this trend, 
we developed NucHunter, a software for base pair resolution 
nucleosome identification in ChlP-seq experiments. The innova- 
tive aspects of this tool reside in a more accurate and efficient 
signal processing, an improved statistical analysis of the peaks, 
the possibihty of integrating data from a control sample and to 
consider multiple histone modifications at once. 

We put forward a nucleosome-centric view, because if we view 
the modifications (either sequentially or in a combinatorial pat- 
tern) as a reflection of a signaling activity then nucleosomes can 



be viewed as 'signaling modules' (Turner, 2012). In agreement 
with this idea, we found that nucleosomes can be clustered into 
distinct subgroups. These subgroups either mark certain func- 
tional regions of the genome, such as promoters and enhancers, 
or are related to biological processes, such as elongation or chro- 
matin-mediated repression. Although this is not a new finding 
[see Ernst and Kelhs (2010)], we think that our approach has the 
benefit of assigning the data to a physical entity that carries 
the information: the nucleosome. Thus, separation of different 
histone modification patterns into distinct subgroups becomes 
much more meaningful than by arbitrarily binning the genome 
into non-overlapping windows (Ernst and Kellis, 2010), where 
two nucleosomes with different modification patterns could 
be present. 
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In summary, we developed a new tool called NucHunter that is 
able to identify positioned nucleosomes along the genome using 
ChlP-seq data of histone modifications and annotates each nu- 
cleosome with (i) a flag indicating presence or absence of a certain 
histone modification and (ii) the number of contributing reads (if 
one is interested in a more quantitative view). We demonstrated 
that NucHunter performs better than currently available tools 
and has some features not present in any of them. By focusing 
on the nucleosome as information carrier, charting the epigenome 
will become much more meaningful and will in the long run allow 
for unraveling novel chromatin-mediated mechanisms. 
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